Extraction of Translation Equivalents from Parallel Corpora

نویسنده

  • Jörg Tiedemann
چکیده

In th e p as t m u ch effo rt w as d ev o ted to th e com p ila tio n o f m u ltilin g u a l p a ra lle l co rp o ra for the p u rp o se o f lingu istic in fo rm atio n re tr ie v a l T h is p ap er a im s to in tro d u ce and evaluate th ree s im p le stra teg ies fo r th e ex trac tio n o f translation, equ ivalen ts fro m stru c tu red para lle l texts. T he g o a l is to su p p o rt the p ro d u c tio n o f b ilin g u a l d ic tionaries fo r d o m a in -sp ec ific app lications. T he ap p ro ach es d escribed in th e p ap e r a ssu m e sen tence a lignm ent, s tr ic t tran s la tio n s , and h istorical re la tio n s betw een con sid ered lan g u ag e pa irs . T hey take advan tage o f co rp u s ch aracteristics like sh o rt a lig n ed u n its and s truc tu ra l & o rth o g rap h ic s im ilarities in o rd er to o b ta in resu lts w ith a h ig h le v e l o f p recision . F u rtherm ore , it w ill b e sh o w n th a t au to m atic f ilte rin g c a n b e u sed to im prove th e p rec is io n o f the ex trac ted m a teria l. S im ple techn iques a re u sed to d e tec t transla tion can d id a te s th a t are m o st likely w rong.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents

In this paper we present and evaluate three approaches to measure comparability of documents in non-parallel corpora. We develop a task-oriented definition of comparability, based on the performance of automatic extraction of translation equivalents from the documents aligned by the proposed metrics, which formalises intuitive definitions of comparability for machine translation research. We de...

متن کامل

Extraction of Translation Equivalents from Parallel Corpora Using Sense-sensitive Contexts

The paper proposes an unsupervised method to extract translation equivalents from parallel corpora. The strategy we use takes into account the context of words. Given a word of the source language and a particular context, we learn its word translation within an equivalent context. We first extract pairs of similar contexts and, then, we compare the similarity between words appearing in each pa...

متن کامل

Extraction of Translation Equivalents from Non-Parallel Corpora

This paper presents a widely applicable method for extracting bilingual expressions from non-parallel corpora. The algorithm first collects word sequences as candidates for translation equivalents that match given patterns of word sequences from each corpus. Then, translation equivalents are selected from these candidates by aligning component words from within word sequences. We show the resul...

متن کامل

Learning Spanish-Galician Translation Equivalents Using a Comparable Corpus and a Bilingual Dictionary

So far, research on extraction of translation equivalents from comparable, non-parallel corpora has not been very popular. The main reason was the poor results when compared to those obtained from aligned parallel corpora. The method proposed in this paper, relying on seed patterns generated from external bilingual dictionaries, allows us to achieve similar results to those from parallel corpus...

متن کامل

Automatic Extraction of Translation Equivalents From Parallel Corpora

This paper presents a simple and effective method for extraction of translation equivalents from parallel corpora. Experiments were conducted on Orwell's "1984" parallel corpus with translations available in six CEE languages, all of them being aligned to the English original. There were extracted six bilingual lexicons X-English (En), where X stands for one of Czech (Cz), Bulgarian (Bg), Eston...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998